Gene Signatures - How to score & interpret them

Author

Marc Elosua Bayes

Published

October 3, 2023

Introduction

Gene signatures are commonly used in routine single cell analysis. Many methods exists but they are not all created equally. In this tutorial we are going to go follow a recent benchmarking paper Badia-i-Mompel et al. (2022) and follow their guidelines on best practices when scoring gene signatures!

With this tutorial we hope to familiarize you with the concepts of gene signatures, how they are scored in single cell datasets and how to interpret the scores obtained!

Before we start here are some key concepts that will help us and frame the vignette!

  • What is a gene signature?

    A “gene signature” can be stated as a single or a group of genes in a cell having a unique pattern of gene expression that is the consequence of either changed biological process or altered pathogenic medical terms Mallik and Zhao (2018).

  • What is a cell type signature?

    A cell type signature is a gene signature representing a group of genes underlying the biological processes characteristic of a cell type.

  • How do we score them in our dataset?

    Scoring a gene signature means to obtain a value for that signature for each cell in our datasets that represents how active the gene program is in each cell. There are many ways to score gene signatures as shown in the decoupleR paper Badia-i-Mompel et al. (2022). However, they do not all perform the same and it is important to select a robust method. The suggested method after their benchmarking analysis is running a Univariate Linear Model (ULM) where the gene expression values are the response variable and the regulator weights in the gene signature are the explanatory one (don’t worry, we’ll go through this in more detail in a second). The obtained t-value from the fitted model is the activity score of that gene signature in that cell.

  • How do we interpret that score?

    Scoring gene signatures using Univariate Linear Models and using the resulting t-value as the scoring metric allows us to simultaneously interpret in a single statistic the direction of activity (either + or -) and its significance (the magnitude of the score).

  • Can we interrogate the scores obtained?

    Yes! In fact it is very important to look past the score obtained by a cell and into which are the genes driving that score. Sometime with gene signatures containing 50 genes it could be that just a few genes are contributing to the signature. If we just stopped at the score we could be mislead into thinking that all of the genes making up the signature are important when it is actually only a fraction of them. Moreover, heterogeneous gene expression between two populations can also lead to 2 cells or populations having similar scores but vastly different genes gene programs underlying them.

Libraries

Load the libraries and install the packages needed to run this notebook

library(tidyverse)
library(Seurat)
library(DT)
library(ggpmisc)
library(ggrepel)

# install tutorial data if it doesn't exist 
if (!require("pbmc3k.SeuratData", character.only = TRUE))
    install.packages(
        "https://seurat.nygenome.org/src/contrib/pbmc3k.SeuratData_3.0.0.tar.gz",
        repos = NULL,
        type = "source")

library(pbmc3k.SeuratData)
# install.packages("BiocManager")
# install the latest development version from GitHub
# BiocManager::install("saezlab/decoupleR")
library(decoupleR)

# BiocManager::install("ComplexHeatmap")
library(ComplexHeatmap)

# Remember to set a seed so the analysis is reproducible!
set.seed(687)

Load data

For this purpose we are going to use the PBMC dataset from 10X Genomics here from the pbmc3k.SeuratData package.

pbmc3k
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)

Analysis

Preprocessing

We will do a quick preprocessing of the data. 1) log-normalize, 2) identify highly variable genes, 3) scale their expression and 4) compute PCA on the scaled data.

pbmc3k <- pbmc3k %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(nfeatures = 3000, verbose = FALSE) %>%
    ScaleData(verbose = FALSE) %>%
    RunPCA(verbose = FALSE)

Next we check the elbow plot to determine the number of PCs to use for the downstream analysis and then compute UMAP, K-nearest neighbor graph (KNN graph) and run Louvain clustering on it.

# Look at elbow plot to assess the number of PCs to use
ElbowPlot(pbmc3k)

We can see a clear elbow at 10 PCs, we’re going to extend it a bit more and use 15 PCs for the downstream analysis to make sure we are not loosing any biological signal

pbmc3k <- RunUMAP(pbmc3k, reduction = "pca", dims = 1:15, verbose = FALSE)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

Next we compute the K-nearest-neighbor graph

pbmc3k <- pbmc3k %>%
    FindNeighbors(verbose = FALSE) %>%
    FindClusters(resolution = c(0.05, 0.1, 0.125, 0.15, 0.2), verbose = FALSE)

# Visualize the clustering on the UMAP
DimPlot(
    pbmc3k,
    group.by = c(
        "RNA_snn_res.0.05", "RNA_snn_res.0.1", "RNA_snn_res.0.125",
        "RNA_snn_res.0.15", "RNA_snn_res.0.2"))

For the purpose of this tutorial we are going to proceed with resolution 0.15

Idents(pbmc3k) <- pbmc3k$RNA_snn_res.0.15
(dim_plt <- DimPlot(
    pbmc3k,
    group.by = "RNA_snn_res.0.15") +
    scale_color_brewer(palette = "Dark2"))

Gene Signature Scoring

Here we define some gene signatures based on prior knowledge. We are setting gene signature that are characteristic for specific cell types to score for their activities in our dataset.

bcell <- c("MS4A1", "CD79A", "CD79B", "BANK1", "HLA-DQB1", "HLA-DQA1")
tcell <- c("CD3D", "CD3E", "TRAC", "TRBC1", "TRBC2", "CD4", "CD8A", "CD8B")
tnaive <- c(tcell, "IL7R", "CCR7", "TCF7", "LEF1", "SELL")
cd8cyto <- c(tcell, "GZMA", "GZMK", "NKG7", "CCL5")
mono <- c("FCGR3A", "CD14", "S100A9", "S100A8", "MS4A7")
nks <- c("NCR1", "NCR2", "NCR3", "FCGR3A", "GZMA", "GZMK", "NKG7", "CCL5")

We can see how there are some genes that are specific for each signature but others are shared between them. This is important to take into account when computing the gene signatures and interpreting their scores.

To help us compute these gene signatures we are going to use the R package decoupleR from Bioconductor. decoupleR is a great for carrying out these analysis since it is a framework that contains different statistical methods to compute these scores. Ultimately we will obtain a score for each signature for each cell.

decoupleR requires the gene signatures to be passed as a dataframe so we are going to convert our gene signature vectors into a unified dataframe. mor stands for Mode Of Regulation, at the moment since we don’t have a score of how important that gene is for that signature we are going to weight them all equally with a value of 1.

sig_ls <- list("B cells" = bcell, "T cells" = tcell, "Naive T cells" = tnaive, "CD8 Cytotoxic" = cd8cyto, "Monocytes" = mono, "NKs" = nks)
sig_df <- lapply(names(sig_ls), function(i) {
    data.frame(
        signature = i,
        gene = sig_ls[[i]],
        mor = 1 
    )
}) %>% bind_rows()

sig_df
       signature     gene mor
1        B cells    MS4A1   1
2        B cells    CD79A   1
3        B cells    CD79B   1
4        B cells    BANK1   1
5        B cells HLA-DQB1   1
6        B cells HLA-DQA1   1
7        T cells     CD3D   1
8        T cells     CD3E   1
9        T cells     TRAC   1
10       T cells    TRBC1   1
11       T cells    TRBC2   1
12       T cells      CD4   1
13       T cells     CD8A   1
14       T cells     CD8B   1
15 Naive T cells     CD3D   1
16 Naive T cells     CD3E   1
17 Naive T cells     TRAC   1
18 Naive T cells    TRBC1   1
19 Naive T cells    TRBC2   1
20 Naive T cells      CD4   1
21 Naive T cells     CD8A   1
22 Naive T cells     CD8B   1
23 Naive T cells     IL7R   1
24 Naive T cells     CCR7   1
25 Naive T cells     TCF7   1
26 Naive T cells     LEF1   1
27 Naive T cells     SELL   1
28 CD8 Cytotoxic     CD3D   1
29 CD8 Cytotoxic     CD3E   1
30 CD8 Cytotoxic     TRAC   1
31 CD8 Cytotoxic    TRBC1   1
32 CD8 Cytotoxic    TRBC2   1
33 CD8 Cytotoxic      CD4   1
34 CD8 Cytotoxic     CD8A   1
35 CD8 Cytotoxic     CD8B   1
36 CD8 Cytotoxic     GZMA   1
37 CD8 Cytotoxic     GZMK   1
38 CD8 Cytotoxic     NKG7   1
39 CD8 Cytotoxic     CCL5   1
40     Monocytes   FCGR3A   1
41     Monocytes     CD14   1
42     Monocytes   S100A9   1
43     Monocytes   S100A8   1
44     Monocytes    MS4A7   1
45           NKs     NCR1   1
46           NKs     NCR2   1
47           NKs     NCR3   1
48           NKs   FCGR3A   1
49           NKs     GZMA   1
50           NKs     GZMK   1
51           NKs     NKG7   1
52           NKs     CCL5   1

ULM

“Univariate Linear Model (ULM) fits a linear model for each sample and regulator, where the observed molecular readouts in mat are the response variable and the regulator weights in net are the explanatory one. Target features with no associated weight are set to zero. The obtained t-value from the fitted model is the activity ulm of a given regulator.”

Moreover, a nice thing about ulm is that in a single statistic it provides the direction of activity (either + or -) and its significance (the magnitude of the score). Making the scores very easy to interpret!

So lets compute the signature scores for every cell in our dataset!

ulm_start <- Sys.time()
res <- decoupleR::run_ulm(
    mat = pbmc3k@assays$RNA@data,
    network = sig_df,
    .source = "signature",
    .target = "gene",
    .mor = "mor")

glue::glue("Time to run ulm is {round(difftime(Sys.time(), ulm_start, units = 's'), 0)} seconds")
Time to run ulm is 1 seconds

We can see how every cells has a score for every signature!

# Looking at the first 10 entries
res
# A tibble: 16,200 × 5
   statistic source  condition       score  p_value
   <chr>     <chr>   <chr>           <dbl>    <dbl>
 1 ulm       B cells AAACATACAACCAC -0.556 5.78e- 1
 2 ulm       B cells AAACATTGAGCTAC  9.47  3.13e-21
 3 ulm       B cells AAACATTGATCAGC -0.678 4.98e- 1
 4 ulm       B cells AAACCGTGCTTCCG  2.17  2.97e- 2
 5 ulm       B cells AAACCGTGTATGCG -0.475 6.35e- 1
 6 ulm       B cells AAACGCACTGGTAC  2.61  9.18e- 3
 7 ulm       B cells AAACGCTGACCAGT -0.563 5.73e- 1
 8 ulm       B cells AAACGCTGGTTCTT -0.565 5.72e- 1
 9 ulm       B cells AAACGCTGTAGCCA  1.27  2.03e- 1
10 ulm       B cells AAACGCTGTTTCTG  1.29  1.95e- 1
# ℹ 16,190 more rows
# Looking at all the scores for one specific cell
res %>% filter(condition == "AAACATACAACCAC")
# A tibble: 6 × 5
  statistic source        condition       score  p_value
  <chr>     <chr>         <chr>           <dbl>    <dbl>
1 ulm       B cells       AAACATACAACCAC -0.556 5.78e- 1
2 ulm       T cells       AAACATACAACCAC  7.27  3.75e-13
3 ulm       Naive T cells AAACATACAACCAC  7.38  1.63e-13
4 ulm       CD8 Cytotoxic AAACATACAACCAC  7.92  2.55e-15
5 ulm       Monocytes     AAACATACAACCAC -0.508 6.12e- 1
6 ulm       NKs           AAACATACAACCAC  2.57  1.02e- 2

How does a univariate linear model work?

Lets start with a toy example. Imagine a very simple scenario where we have two very simple vectors where one is double the other. We can compute the linear model and also easily visualize the relationship between both vectors:

# Define vectors of interest
vec1 <- c(1, 2, 5)
vec2 <- c(2.1, 3.8, 9.7)

# Run the linear model
summary(lm(vec2 ~ vec1))

Call:
lm(formula = vec2 ~ vec1)

Residuals:
       1        2        3 
 0.09231 -0.12308  0.03077 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.09231    0.16853   0.548   0.6810  
vec1         1.91538    0.05329  35.940   0.0177 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1569 on 1 degrees of freedom
Multiple R-squared:  0.9992,    Adjusted R-squared:  0.9985 
F-statistic:  1292 on 1 and 1 DF,  p-value: 0.01771
# Visualize the data
(p <- ggplot(mapping = aes(x = vec1, y = vec2)) +
    geom_point() +
    geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
    coord_fixed() +
    xlim(0, 10) +
    ylim(0, 10) +
    theme_minimal())

# now we can add the slope of the line that best fits our data and the T value
p +
    stat_poly_line(formula = y ~ x, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

In the example above we see the linear relationship between both vectors and we get the slope and the T value:
- The slope indicates the what is the change in the response variable (vec2) given a 1 unit change in the predictor variable (vec1).

- The T statistic is the result of a T test. The T test assesses the significance of individual coefficients in our model. The T value indicates the number of standard errors the estimated coefficient is away from the null hypothesis (t = 0). Remember the T value is the \(\frac{coefficient}{standard~error}\).

Now lets look at a “real world” example, we want to score the B cell signature in one cell. First we are going to start by visualizing the relationship between the weights and the gene expression for 2 cells of interest, one is a B cell and the other is not.

We need to do a bit of data prep but bear with me!

# We have our gene expression matrix
mat <- as.matrix(pbmc3k@assays$RNA@data)

# We want to obtain a matrix with 1s and 0s indicating the weight each gene has for each signature
## Initialize mor_mat with all 0s
sources <- unique(sig_df$signature)
targets <- rownames(mat)
mor_mat <- matrix(0, ncol = length(sources), nrow = nrow(mat))
colnames(mor_mat) <- sources
rownames(mor_mat) <- targets
weights <- sig_df$mor

# Fill in the matrix with the weights in the right places
for (i in 1:nrow(sig_df)) {
    .source <- sig_df$signature[[i]]
    .target <- sig_df$gene[[i]]
    .weight <- weights[[i]]
    if (.target %in% targets) {
        mor_mat[[.target, .source]] <- .weight
    }
}
# labels for geom_text_repel
repel_text <- rownames(mat)
keep <- which(rownames(mat) %in% bcell)
# Set non-selected positions to NA
repel_text[-keep] <- NA

# Visualize the data
ggplot(mapping = aes(x = mat[, "TTTGCATGAGAGGC"], y = mor_mat[, "B cells", drop = FALSE])) +
    geom_point() +
    ggrepel::geom_text_repel(aes(label = repel_text)) +
    geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
    labs(x = "Gene Expression", y = "Gene Weight", title = "Cell TTTGCATGAGAGGC - B cell") +
    coord_fixed() +
    xlim(0, 10) +
    ylim(0, 5) +
    theme_minimal() +
    # now we can add the slope of the line that best fits our data and the T value
    stat_poly_line(formula = x ~ y, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

# Visualize the data
ggplot(mapping = aes(x = mat[, "AAGATGGAGATAAG"], y = mor_mat[, "B cells", drop = FALSE])) +
    geom_point() +
    ggrepel::geom_text_repel(aes(label = repel_text)) +
    geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
    labs(x = "Gene Expression", y = "Gene Weight", title = "Cell AAGATGGAGATAAG - Not a B cell") +
    coord_fixed() +
    xlim(0, 7.5) +
    ylim(0, 2.5) +
    theme_minimal() +
    # now we can add the slope of the line that best fits our data and the T value
    stat_poly_line(formula = x ~ y, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

Next we are going to manually run the models for these two cells so that we can see that the results obtained from decoupleR make sense!

Check that the mor_mat has the weights in the right places

bcell
[1] "MS4A1"    "CD79A"    "CD79B"    "BANK1"    "HLA-DQB1" "HLA-DQA1"
mor_mat["MS4A1", ] # We can see how MS4A1 only has a weight in the B cell signature!
      B cells       T cells Naive T cells CD8 Cytotoxic     Monocytes 
            1             0             0             0             0 
          NKs 
            0 
mor_mat["CD3D", ] # We can see how CD3D has weights in all T cell signatures!
      B cells       T cells Naive T cells CD8 Cytotoxic     Monocytes 
            0             1             1             1             0 
          NKs 
            0 

Lets run the a linear model to score two cell for the B cell signature

mod1 <- lm(mat[, "TTTGCATGAGAGGC", drop = FALSE] ~ mor_mat[, "B cells", drop = FALSE])
summary(mod1)

Call:
lm(formula = mat[, "TTTGCATGAGAGGC", drop = FALSE] ~ mor_mat[, 
    "B cells", drop = FALSE])

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9124 -0.0895 -0.0895 -0.0895  6.8836 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        0.089525   0.004293  20.852   <2e-16 ***
mor_mat[, "B cells", drop = FALSE] 1.822836   0.205258   8.881   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5027 on 13712 degrees of freedom
Multiple R-squared:  0.005719,  Adjusted R-squared:  0.005646 
F-statistic: 78.87 on 1 and 13712 DF,  p-value: < 2.2e-16
1.822836 / 0.205258 # This equals the T value
[1] 8.880706
mod2 <- lm(mat[, "AAGATGGAGATAAG", drop = FALSE] ~ mor_mat[, "B cells", drop = FALSE])
summary(mod2)

Call:
lm(formula = mat[, "AAGATGGAGATAAG", drop = FALSE] ~ mor_mat[, 
    "B cells", drop = FALSE])

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2980 -0.1471 -0.1471 -0.1471  5.7711 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        0.147106   0.004526  32.504   <2e-16 ***
mor_mat[, "B cells", drop = FALSE] 0.150863   0.216370   0.697    0.486    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5299 on 13712 degrees of freedom
Multiple R-squared:  3.545e-05, Adjusted R-squared:  -3.747e-05 
F-statistic: 0.4862 on 1 and 13712 DF,  p-value: 0.4857
0.150863 / 0.216370 # This equals the T value
[1] 0.6972455

We can see how mod1 has returned a high coefficient for cell TTTGCATGAGAGGC while mod2 has returned a low coefficient for cell AAGATGGAGATAAG. Moreover, when we look at the T value for the B cell signature we see how in mod1 it is 8.881 while for mod2 it is 0.697.

Lets check that the T values obtained manually actually match those returned by decoupleR

res %>% filter(source == "PAX5" & condition %in% c("TTTGCATGAGAGGC", "AAGATGGAGATAAG"))
# A tibble: 0 × 5
# ℹ 5 variables: statistic <chr>, source <chr>, condition <chr>, score <dbl>,
#   p_value <dbl>

Effectively, they do!

Visualization

We can directly add the ulm scores to an assay in our object and visualize the results

pbmc3k[['pathwaysulm']] <- res %>%
  pivot_wider(
      id_cols = 'source',
      names_from = 'condition',
      values_from = 'score') %>%
  column_to_rownames('source') %>%
  Seurat::CreateAssayObject(.)

# Change assay
DefaultAssay(object = pbmc3k) <- "pathwaysulm"

# Scale the data for comparison across signatures 
pbmc3k <- ScaleData(pbmc3k)
Centering and scaling data matrix
pbmc3k@assays$pathwaysulm@data <- pbmc3k@assays$pathwaysulm@scale.data
UMAP visualization

Plot all the gene signatures one after the other

plt <- FeaturePlot(
    pbmc3k,
    features = rownames(pbmc3k@assays$pathwaysulm),
    ncol = 3) &
    scale_color_viridis_c(option = "magma")

plt + dim_plt

Heatmap by groups

We can also visualize the gene signature scores for each individual cell using a heatmap

DoHeatmap(
    pbmc3k,
    features = rownames(pbmc3k@assays$pathwaysulm),
    slot = "data",
    group.colors = RColorBrewer::brewer.pal(n = 7, name = "Dark2")) +
    scale_fill_viridis_c(option = "viridis")
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

From the plot above we can see how we have very distinct populations in our datasets. We can also look at it a bit less granular by looking at the mean activity score per cluster.

# Extract activities from object as a long dataframe
df <- t(as.matrix(pbmc3k@assays$pathwaysulm@data)) %>%
  as.data.frame() %>%
  mutate(cluster = pbmc3k$RNA_snn_res.0.15) %>%
  pivot_longer(cols = -cluster, names_to = "source", values_to = "score") %>%
  group_by(cluster, source) %>%
  summarise(mean = mean(score))
`summarise()` has grouped output by 'cluster'. You can override using the
`.groups` argument.
# Transform to wide matrix
top_acts_mat <- df %>%
  pivot_wider(id_cols = 'cluster', names_from = 'source',
              values_from = 'mean') %>%
  column_to_rownames('cluster') %>%
  as.matrix()
# Choose color palette
palette_length = 100
my_color = colorRampPalette(c("Darkblue", "white","red"))(palette_length)

# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")
Note that the maximum scaled value is: 2.41, and the minimum is -1.05.
my_breaks <- c(seq(-2, 0, length.out=ceiling(palette_length/2) + 1),
               seq(0.05, 2, length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)

Heatmap for gene expression

To fully grasp which genes are driving each gene signature within each cell we want to visualize the gene expression of the genes involved in each gene signature for each cell. We can do so using the ComplexHeatmap package and a little bit of data processing. For ease here is a function you can incorporate in your analysis:

geneHM <- function(
        object,
        sig_df,
        sig_name,
        sig_assay,
        .source,
        .target,
        sig_slot = "data",
        expr_assay = "RNA",
        expr_slot = "data",
        grouping = NULL,
        grouping_color = NULL,
        expr_cols = viridisLite::magma(100)) {
    
    # Extract Gene Expression Matrix from Seurat Object
    gene_expr <- GetAssayData(object, assay = expr_assay, slot = expr_slot)
    
    # Subset the genes of the signature from the Gene Expression Matrix
    genes_of_interest <- sig_df[, .target][which(sig_df[, .source] %in% sig_name)]
    
    # Subset the genes intersecting between gene expression and genes in signature
    g_int <- intersect(rownames(gene_expr), genes_of_interest)
    
    if (length(g_int) < length(genes_of_interest)) {
        genes_excluded <- genes_of_interest[!genes_of_interest %in% rownames(gene_expr)]
        genes_excluded <- paste(genes_excluded, collapse = ", ")
        message(paste0(
            "Genes ", genes_excluded,
            " are in the gene signature but not in the expression matrix,",
            " therefore, they have been excluded."))
    }
    
    # Subset expression matrix to only genes of interest
    gene_expr <- gene_expr[g_int, ]
    
    # Extract the Scores of the Signature of interest
    sig_score <- GetAssayData(object, assay = sig_assay, slot = sig_slot)
    sig_vec <- sig_score[sig_name, ]
    anno <- data.frame(score = sig_vec)
    # Make sure they are in the right order
    anno <- anno[colnames(gene_expr), , drop = FALSE]
    
    # Add any metadata if specified
    if (!is.null(grouping)) {
        meta <- object@meta.data[, grouping, drop = FALSE]
        anno <- cbind(anno, meta[rownames(anno), , drop = FALSE])
    }
    
    if (any(is.infinite(c(anno$score))))
        stop("There are scores with Inf values, please address this outside of this function. It could be because the slot used is scale_data.")
    
    # Make list of color to paint the annotation columns
    if (!is.null(grouping) & !is.null(grouping_color)) {
        score <- circlize::colorRamp2(
            breaks = c(min(anno$score), 0, max(anno$score)),
            colors = c("blue", "white", "red"))
        
        color_ls <- append(grouping_color, score)
        names(color_ls)[length(color_ls)] <- "score"
        
    } else {
        color_ls <- list(
            score = circlize::colorRamp2
            (breaks = c(min(anno$score), 0, max(anno$score)),
                      colors = c("blue", "white", "red")),
            RNA_snn_res.0.15 = clust_color)
    }
    
    # Set the order from most expressing to least expressing 
    ord <- rownames(anno[order(anno$score, decreasing = TRUE), ])
    # Add the score of the signature as annotation in the heatmap
    colAnn <- HeatmapAnnotation(
        df = anno[ord, , drop = FALSE],
        which = 'column',
        col = color_ls)
    
    # Visualize the Heatmap with the genes and signature 
    ht <- Heatmap(
        as.matrix(gene_expr[, ord]),
        name = "Gene Expression",
        col = expr_cols,
        cluster_rows = TRUE,
        cluster_columns = TRUE,
        column_title = sig_name,
        column_names_gp = gpar(fontsize = 14),
        show_column_names = FALSE,
      top_annotation = colAnn)
    
    # Return ComplexHeatmap
    draw(ht)
}

# Define colors for the grouping variable
clust_color <- RColorBrewer::brewer.pal(
    length(unique(pbmc3k$RNA_snn_res.0.15)),
    name = "Dark2")
names(clust_color) <- levels(pbmc3k$RNA_snn_res.0.15)

# Visualize the heatmaps for all signatures
# tt <- lapply(unique(sig_df$signature), function(i) {
#     geneHM(
#         object = pbmc3k,
#         sig_df = sig_df,
#         .source = "signature",
#         .target = "gene",
#         sig_name = i,
#         expr_slot = "data",
#         expr_assay = "RNA",
#         sig_assay = "pathwaysulm",
#         sig_slot = "data",
#         grouping = c("RNA_snn_res.0.15"),
#         grouping_color = list(RNA_snn_res.0.15 = clust_color))
# })

Here are some examples of how to interpret these gene signatures:

  1. In the Monocyte signature not all cells that have the same score have the same genes expressed. For example, wee can see how among the cells with high scores there are cells that express CD14 with a gradient switching to expression of FCGR3A. Therefore, classifying all of these cells as the same just because the gene signature scoring returns the same value would be a mistake. In this case, a likely scenario could be that the gene signature isn’t teasing out differences between classical, intermediate, and non-classical monocytes.
geneHM(
    object = pbmc3k,
    sig_df = sig_df,
    .source = "signature",
    .target = "gene",
    sig_name = "Monocytes",
    expr_slot = "data",
    expr_assay = "RNA",
    sig_assay = "pathwaysulm",
    sig_slot = "data",
    grouping = c("RNA_snn_res.0.15"),
    grouping_color = list(RNA_snn_res.0.15 = clust_color))

  1. When looking at the CD8 Cytotoxic compartment we also observe how NK cells have a high score despite not expressing CD3 genes. This can be due to the higher expression of NKG7 i NKs vs CD8 T cells for example.
geneHM(
    object = pbmc3k,
    sig_df = sig_df,
    .source = "signature",
    .target = "gene",
    sig_name = "CD8 Cytotoxic",
    expr_slot = "data",
    expr_assay = "RNA",
    sig_assay = "pathwaysulm",
    sig_slot = "data",
    grouping = c("RNA_snn_res.0.15"),
    grouping_color = list(RNA_snn_res.0.15 = clust_color))
Genes TRAC, TRBC1, TRBC2 are in the gene signature but not in the expression matrix, therefore, they have been excluded.

  1. This is more of a dummy example but when assessing the T cell signature c("CD3D", "CD3E", "TRAC", "TRBC1", "TRBC2", "CD4", "CD8A", "CD8B") CD8 T cells have a higher score than CD4 T cells. This is due to there being two CD8 genes (A & B) as well as CD8B being expressed at higher levels than CD4. Therefore, we need to keep an eye on these things to better interpret the heterogeneity within our populations.
geneHM(
    object = pbmc3k,
    sig_df = sig_df,
    .source = "signature",
    .target = "gene",
    sig_name = "T cells",
    expr_slot = "data",
    expr_assay = "RNA",
    sig_assay = "pathwaysulm",
    sig_slot = "data",
    grouping = c("RNA_snn_res.0.15"),
    grouping_color = list(RNA_snn_res.0.15 = clust_color))
Genes TRAC, TRBC1, TRBC2 are in the gene signature but not in the expression matrix, therefore, they have been excluded.

Briefly and to the point, when scoring gene signatures and looking at their activities it is important to not only look at the oveall score obtained for each cell but one also needs to dive deeper into which are the genes that are driving that signature!

Extra!

In the above example we showed how to compute signature scores using ulm, if we take a closer look to the original decoupleR paper (badia-i-mompel 2022) we can see how in Supplementary Figure 3 ulm and mlm slightly outperform norm_wmean in terms of AUROC and AUPRC. Moreover, in the Bioconductor Vignettes they showcase the use of run_wmean instead of ulm. So… why use ulm instead of norm_wmean?


Run norm_wmean

In this section we are going to show how computing the normalized weighted mean with 1,000 permutations provides a very similar result to the ulm but takes much longer!.

Run decouple on our data using the wmean method. As mentioned in the details of the function: “WMEAN infers regulator activities by first multiplying each target feature by its associated weight which then are summed to an enrichment score wmean. Furthermore, permutations of random target features can be performed to obtain a null distribution that can be used to compute a z-score norm_wmean, or a corrected estimate corr_wmean by multiplying wmean by the minus log10 of the obtained empirical p-value.”.

wmean_start <- Sys.time()
res_wmean <- decoupleR::run_wmean(
    mat = pbmc3k@assays$RNA@data,
    network = sig_df,
    .source = "signature",
    .target = "gene",
    .mor = "mor",
    times = 1000)

glue::glue("Time to run norm_wmean with 1000 iterations is {round(difftime(Sys.time(), wmean_start, units = 's'), 0)} seconds")
Time to run norm_wmean with 1000 iterations is 96 seconds
res_wmean
# A tibble: 48,600 × 5
   statistic  source  condition      score p_value
   <chr>      <chr>   <chr>          <dbl>   <dbl>
 1 corr_wmean B cells AAACATACAACCAC 0       0.576
 2 corr_wmean B cells AAACATTGAGCTAC 5.74    0.002
 3 corr_wmean B cells AAACATTGATCAGC 0       0.84 
 4 corr_wmean B cells AAACCGTGCTTCCG 0.549   0.13 
 5 corr_wmean B cells AAACCGTGTATGCG 0       0.422
 6 corr_wmean B cells AAACGCACTGGTAC 0.807   0.064
 7 corr_wmean B cells AAACGCTGACCAGT 0       0.59 
 8 corr_wmean B cells AAACGCTGGTTCTT 0       0.644
 9 corr_wmean B cells AAACGCTGTAGCCA 0.297   0.152
10 corr_wmean B cells AAACGCTGTTTCTG 0.294   0.172
# ℹ 48,590 more rows

We obtain a long format tibble containing:

  • statistic - Indicating which method is associated with each score

    • wmean: multiplying each target feature by its associated weight which then are summed to an enrichment score

    • norm_wmean: permutations of random target features can be performed to obtain a null distribution that can be used to compute a z-score norm_wmean

    • corr_wmean: corrected estimate by multiplying wmean by the minus log10 of the obtained empirical p-value

  • source (aka - signature name)

  • condition - cell barcode

  • score - the signature score, the inferred biological activity.

  • p_value - P value obtained from permutations

Compare ulm with norm_wmean scores

res2 <- res %>%
    dplyr::select(statistic, source, score, condition)
colnames(res2) <- glue::glue("{colnames(res2)}_ulm")

res_wmean2 <- res_wmean %>%
    dplyr::filter(statistic == "norm_wmean") %>%
    dplyr::select(statistic, source, score, condition)
colnames(res_wmean2) <- glue::glue("{colnames(res_wmean2)}_wmean")

res2 %>%
    left_join(
        res_wmean2,
        by = c("condition_ulm" = "condition_wmean", "source_ulm" = "source_wmean")) %>%
    ggplot(aes(x = score_ulm, y = score_wmean)) +
    geom_point() +
    facet_wrap(~source_ulm, scales = "free") +
    stat_smooth(method = "lm", formula = y ~ x, geom = "smooth") +
    ggpubr::stat_cor(method = "pearson") +
    labs(x = "ulm", y = "norm_wmean") +
    theme_minimal()

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ComplexHeatmap_2.16.0   decoupleR_2.5.3         pbmc3k.SeuratData_3.0.0
 [4] ggrepel_0.9.3           ggpmisc_0.5.4-1         ggpp_0.5.4             
 [7] DT_0.29                 SeuratObject_4.1.3      Seurat_4.3.0.1         
[10] lubridate_1.9.2         forcats_1.0.0           stringr_1.5.0          
[13] dplyr_1.1.3             purrr_1.0.2             readr_2.1.4            
[16] tidyr_1.3.0             tibble_3.2.1            ggplot2_3.4.3          
[19] tidyverse_2.0.0        

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3     shape_1.4.6            rstudioapi_0.15.0     
  [4] jsonlite_1.8.7         magrittr_2.0.3         magick_2.7.5          
  [7] spatstat.utils_3.0-3   farver_2.1.1           rmarkdown_2.24        
 [10] GlobalOptions_0.1.2    vctrs_0.6.3            ROCR_1.0-11           
 [13] Cairo_1.6-1            spatstat.explore_3.2-1 rstatix_0.7.2         
 [16] htmltools_0.5.6        polynom_1.4-1          broom_1.0.5           
 [19] sctransform_0.3.5      parallelly_1.36.0      KernSmooth_2.23-22    
 [22] htmlwidgets_1.6.2      ica_1.0-3              plyr_1.8.8            
 [25] plotly_4.10.2          zoo_1.8-12             igraph_1.5.1          
 [28] iterators_1.0.14       mime_0.12              lifecycle_1.0.3       
 [31] pkgconfig_2.0.3        Matrix_1.6-1           R6_2.5.1              
 [34] fastmap_1.1.1          clue_0.3-64            fitdistrplus_1.1-11   
 [37] future_1.33.0          shiny_1.7.5            digest_0.6.33         
 [40] colorspace_2.1-0       S4Vectors_0.38.1       patchwork_1.1.3       
 [43] confintr_1.0.2         tensor_1.5             irlba_2.3.5.1         
 [46] ggpubr_0.6.0           labeling_0.4.3         progressr_0.14.0      
 [49] fansi_1.0.4            spatstat.sparse_3.0-2  timechange_0.2.0      
 [52] mgcv_1.9-0             httr_1.4.7             polyclip_1.10-4       
 [55] abind_1.4-5            compiler_4.3.1         withr_2.5.0           
 [58] doParallel_1.0.17      backports_1.4.1        carData_3.0-5         
 [61] ggsignif_0.6.4         MASS_7.3-60            quantreg_5.97         
 [64] rjson_0.2.21           tools_4.3.1            lmtest_0.9-40         
 [67] httpuv_1.6.11          future.apply_1.11.0    goftest_1.2-3         
 [70] glue_1.6.2             nlme_3.1-163           promises_1.2.1        
 [73] Rtsne_0.16             cluster_2.1.4          reshape2_1.4.4        
 [76] generics_0.1.3         gtable_0.3.4           spatstat.data_3.0-1   
 [79] tzdb_0.4.0             data.table_1.14.8      hms_1.1.3             
 [82] car_3.1-2              sp_2.0-0               utf8_1.2.3            
 [85] BiocGenerics_0.46.0    spatstat.geom_3.2-4    RcppAnnoy_0.0.21      
 [88] foreach_1.5.2          RANN_2.6.1             pillar_1.9.0          
 [91] later_1.3.1            circlize_0.4.15        splines_4.3.1         
 [94] lattice_0.21-8         survival_3.5-7         deldir_1.0-9          
 [97] SparseM_1.81           tidyselect_1.2.0       miniUI_0.1.1.1        
[100] pbapply_1.7-2          knitr_1.43             gridExtra_2.3         
[103] IRanges_2.34.1         scattermore_1.2        stats4_4.3.1          
[106] xfun_0.40              matrixStats_1.0.0      stringi_1.7.12        
[109] lazyeval_0.2.2         yaml_2.3.7             evaluate_0.21         
[112] codetools_0.2-19       cli_3.6.1              uwot_0.1.16           
[115] xtable_1.8-4           reticulate_1.32.0.9002 munsell_0.5.0         
[118] Rcpp_1.0.11            globals_0.16.2         spatstat.random_3.1-5 
[121] png_0.1-8              parallel_4.3.1         MatrixModels_0.5-2    
[124] ellipsis_0.3.2         listenv_0.9.0          viridisLite_0.4.2     
[127] scales_1.2.1           ggridges_0.5.4         crayon_1.5.2          
[130] leiden_0.4.3           GetoptLong_1.0.5       rlang_1.1.1           
[133] cowplot_1.1.1         

References

Badia-i-Mompel, Pau, Jesús Vélez Santiago, Jana Braunger, Celina Geiss, Daniel Dimitrov, Sophia Müller-Dott, Petr Taus, et al. 2022. “decoupleR: Ensemble of Computational Methods to Infer Biological Activities from Omics Data.” Edited by Marieke Lydia Kuijjer. Bioinformatics Advances 2 (1). https://doi.org/10.1093/bioadv/vbac016.
Mallik, Saurav, and Zhongming Zhao. 2018. “Identification of Gene Signatures from RNA-Seq Data Using Pareto-Optimal Cluster Algorithm.” BMC Systems Biology 12 (S8). https://doi.org/10.1186/s12918-018-0650-2.